Heterogeneous Learner for Web Page Classification
نویسندگان
چکیده
Classification of an interesting class of Web pages (e.g., personal homepages, resume pages) has been an interesting problem. Typical machine learning algorithms for this problem require two classes of data for training: positive and negative training examples. However, in application to Web page classification, gathering an unbiased sample of negative examples appears to be difficult. We propose a heterogeneous learning framework for classifying Web pages, which (1) eliminates the need for negative training data, and (2) increases classification accuracy by using two heterogeneous learners. Our framework uses two heterogeneous learners – a decision list and a linear separator which complement each other – to eliminate the need for negative training data in the training phase and to increase the accuracy in the testing phase. Our results show that our heterogeneous framework achieves high accuracy without requiring negative training data; it enhances the accuracy of linear separators by reducing the errors on “low-margin data”. That is, it classifies more accurately while requiring less human efforts in training.
منابع مشابه
A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملCombining ILP with Semi-supervised Learning for Web Page Categorization
This paper presents a semi-supervised learning algorithm called Iterative-Cross Training (ICT) to solve the Web pages classification problems. We apply Inductive logic programming (ILP) as a strong learner in ICT. The objective of this research is to evaluate the potential of the strong learner in order to boost the performance of the weak learner of ICT. We compare the result with the supervis...
متن کاملWeb as a textbook: Curating Targeted Learning Paths through the Heterogeneous Learning Resources on the Web
A growing subset of the web today is aimed at teaching and explaining technical concepts with varying degrees of detail and to a broad range of target audiences. Content such as tutorials, blog articles and lecture notes is becoming more prevalent in many technical disciplines and provides up-to-date technical coverage with widely different levels of prerequisite assumptions on the part of the ...
متن کاملA Redundant Covering Algorithm Applied to Text Classification
Covering algorithms for learning rule sets tend toward learning concise rule sets based on the training data. This bias may not be appropriate in the domain of text classification due to the large number of informative features these domains typically contain. We present a basic covering algorithm, DAIRY, that learns unordered rule sets, and present two extensions that encourage the rule learne...
متن کاملHybrid Adaptive Educational Hypermedia Recommender Accommodating User’s Learning Style and Web Page Features
Personalized recommenders have proved to be of use as a solution to reduce the information overload problem. Especially in Adaptive Hypermedia System, a recommender is the main module that delivers suitable learning objects to learners. Recommenders suffer from the cold-start and the sparsity problems. Furthermore, obtaining learner’s preferences is cumbersome. Most studies have only focused...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002